Method for detecting emotions involving subspace specialists

ABSTRACT

To detect and determine a current emotional state (CES) of a human being from a spoken speech input (SI), it is suggested in a method for detecting emotions to identify first and second feature classes (A, E) with, in particular distinct, dimensions of an underlying emotional manifold (EM) or emotional space (ES) and/or with subspaces thereof.

[0001] The present invention relates to a method for detecting emotionsand in particular to a method for detecting emotions from speech inputby involving so-called subspace specialists.

[0002] In many applications, it is desired to detect the currentemotional state of a human speaker, e.g. of a user of an equipment, orthe like. Many methods for detecting the emotional state of a humanbeing have been described. Many of these known methods employ andevaluate different sources of features, visual sources, acousticalsources, and other physiological sources, e.g. tension, humidity andtemperature of the skin, blood pressure, the rate of the beating heart,and/or the like.

[0003] In the case of acoustical speech input, however, emotiondetection is a very difficult problem, because the inter-speakervariance of emotional speech is very high. Therefore, evaluation of asingle class of features of the speech input might not be sufficient todetect a current emotional state of a human speaker in a confident way.

[0004] It is an object of the present invention to provide a method fordetecting emotions from acoustical speech in which the error of such adetection is particularly low and the detection itself is more accurateand more refined.

[0005] This object is achieved by a method for detecting emotions withthe characterizing features of claim 1. The object is also achieved by asystem for carrying out the method according to the features of claim 14as well as by a computer program product according to the features ofclaim 15. Preferred embodiments of the inventive method for detectingemotions are within the scope of the respective dependent subclaims.

[0006] The invention is based on the finding and on the assumption thatthe variety of human emotions and affects can be represented in amultidimensional space in particular of two dimensions—and that every orboth dimensions are relevant for a classification and recognition.

[0007] According to the invention, in the method for detecting emotionsfrom speech input—in particular of at least one speaker—at least a firstfeature class and a second feature class of features are at least inpart evaluated, derived and/or extracted from a given speech input. Fromsaid features and/or from parameters thereof a current emotional stateof a current speaker and/or parameters thereof are derived. Said firstand second feature classes are identified and/or associated with, inparticular distinct, dimensions of an underlying emotional manifold orspace and/or with subspaces thereof, in particular with activation orarousal and evaluation or pleasure, respectively.

[0008] It is therefore a key idea of the present invention to identifyand/or associate the distinct first and second feature classes withdimensions of a given emotional manifold, emotional space and/or withsubspaces thereof. In contrast to state of art methods for detectingemotions from speech, the present invention not only involves severalfeature classes but also identifies and therefor maps said featureclasses to the dimensions of the emotional manifold or emotional spaceto give a refinement of the description of the features and therefore arefinement of the detecting process of the emotional states of thespeaker. This may be done based on the different degrees of complexityfor each dimension.

[0009] According to a preferred embodiment of the present invention itis suggested to use and/or to construct for each feature class, for eachdimension/subspace of said emotional manifold and/or for groups thereofin each case a separate and/or distinct specialist—in particularsubspace specialist, or the like—or a specialized classifier system.Each of said specialized classifier systems is in particular adapted toobtain, generate, evaluate and/or classify features essentially from oneassigned feature class and/or from an assigned dimension or subspace ofemotions. In general, there are features which are necessary for adistinct classifier only. Additionally, there might be features whichare used by several or all classifiers. Here, a feature class for agiven classifier Is referred to be the complete set of featuresnecessary for said certain classifier.

[0010] In a further advantageous embodiment of the inventive method fordetecting emotions from speech the distinct specialized classifiersystems are applied to the different feature classes already extractedfrom said speech input and/or to said speech input directly. Further,the thereby evaluated, derived and/or extracted features, emotions,and/or parameters thereof, in particular from different featuresubspaces, are collected and/or stored in particular in order to obtaina final classification by combination of the results.

[0011] It is preferred that the features, emotions, and/or parametersthereof evaluated, derived and/or extracted, in particular fromdifferent feature subspaces, are combined, in particular to describe acurrent emotional state of the current user and/or of parametersthereof.

[0012] In a further preferred embodiment of the inventive method fordetecting emotions from speech distinct specialized classifier systemsand/or the features or outputs thereof are combined, fused and/ormerged, in particular to form a global classifier system and/or—in thecase of non-orthogonal subspaces or classifier systems—in particular bymeans of a empirical weighting algorithm, or the like. This is done soas to deal and take into account common human speaking behaviour and thedependence thereof from the underlying emotional state.

[0013] It is of advantage to use a class of prosodic features at leastas a part of said first feature class, which is in particular at leastin part identified with an activation and/or an arousal dimension ofemotions or of the emotional manifold.

[0014] Alternatively or additionally, it is of further advantage to usea class of speech and/or voice quality features at least as a part ofsaid second feature class, which is in particular at least in partidentified with a pleasure and/or evaluation dimension of emotions or ofthe emotional manifold.

[0015] It is further preferred to use a—in particularunidimensional—classifier system of a low complexity, in particular as aclassifier system for prosodic features or for a prosodic feature class.

[0016] Additionally or alternatively, it is preferred to use a—inparticular unidimensional—classifier system of high complexity, inparticular as a classifier system for speech and/or voice qualityfeatures or for a voice and/or speech quality feature class.

[0017] Unidimensional or one-dimensional classifier system are herereferred to as classifier system which do not mix their outputs.

[0018] In particular said classifier system of high complexity mayinclude a multiplicity of single classifiers, in particular byimplementing speaker dependencies. These speaker dependencies mayinclude age, gender and/or the like.

[0019] According to a further aspect of the invention the differentclassifiers can give as an output not only proper emotions, but also adegree of emotions, in particular a degree of pleasure and/oractivation, depending on the feature subspaces they have as an input,which can be combined afterwards to obtain a current emotional state ofthe speaker.

[0020] As prosodic features, pitch, pitch range, intonation attitude,loudness, speaking rate, phone duration, speech element durationfeatures and/or the like may be used.

[0021] As speech and/or voice quality features phonation type,articulation manner, voice timbre features, spectral tilt, amplitudedifference between harmonics and formants, formants bandwidth, jitter,harmonic-to-noise ratio features and/or the like may be used.

[0022] It is a further aspect of the present invention to provide asystem, an apparatus, a device, and/or the like for detecting emotionsfrom input speech which is, in each case, capable of performing and/orrealizing the inventive method for detecting emotions from speech inputand/or its steps.

[0023] According to a further aspect of the present invention, acomputer program product is provided, comprising computer program meanswhich is adapted to perform and/or to realize the inventive method fordetecting emotions from speech input and/or its steps when it isexecuted on a computer, a digital signal processing means and/or thelike.

[0024] Further aspects of the present invention become more apparentfrom the following remarks:

[0025] One basic idea of the present invention is to use emotionaldimensions in order to design a classifier for automatic emotionrecognition or detection. The variety of human emotions and affects canbe represented in a multidimensional space or manifold, in particular oftwo dimensions. One dimension, for example, refers to activation orarousal. The other dimension refers to evaluation or pleasure. Emotionswhich are placed in the same area of the emotional manifold or emotionalspace have similar features in terms of acoustics and they are moredifficult to classify. Therefore, the application of subspacespecialists and a technique thereof based on the association of theemotional dimensions and the feature space can give better recognitionrates and can lower detection errors.

[0026] Common and known emotion classification schemes use differentclassifiers, such as neural networks, learning vector quantization,linear discriminant analysis, cart regression trees, nearest neighbour,K-nearest neighbour, and/or the like.

[0027] Recognizing emotion from a speech signal is not an easy task.Until now, most of the known classifiers make use of prosody features orprosodic features. These prosodic features are easy to handle, but theygive only information about a so-called activation or arousal dimensionof emotions.

[0028] It is an aspect of the present invention to take into account atleast one second dimension of the emotional space. In particular, it issuggested to evaluate the pleasure or evaluation dimension of saidemotional space or emotional manifold. Such a dimension is very muchinfluenced by quality features of the speech or the voice, i.e. byauditory features that arise from variations in the source signal andthe vocal tract properties. These quality features are veryspeaker-dependent.

[0029] One of the main problems when designing a classifier for emotionrecognition or emotion detection from the speech is the fact that thesame emotion can be expressed or mapped onto different features,depending on the speaker. Some speakers make changes in only one of thepossible emotional dimensions in the emotional space. For speakers, whomake use of a multiplicity of such emotional dimensions, it is difficultto define a common range of emotions.

[0030] Assuming that a multiplicity of dimensions of the emotionalspace—in the above-mentioned case two dimensions are used—is relevantfor an accurate emotional classification and that those emotions relateto different features of the speech with a different degree ofcomplexity, a proposal according to the invention is to use bothconcepts in order to get an optimal design for the classifier.

[0031] One idea behind the present invention is to make use of amultiplicity of emotional dimensions in order to design a classifier forautomatic emotion recognition and detection. This idea is combined witha further idea to apply a subspace specialist technique based on theassociation of the emotional dimensions and the feature space.Basically, it would be sufficient in accordance with one aspect of theinvention to involve the application of different classifiers for eachof the feature subspaces assigned to prosodic features and to qualityfeatures and to combine the results from the different classifiers.

[0032] A further aspect of the present invention is to improve this keyidea by one of the following two approaches or by a combination of them.

[0033] (a) Since both dimensions of the emotional space on the basis ofa two-dimensional concept relate to different features of the speechwith different degree of complexity, it makes sense to divide theproblem and design two classification techniques. Each of theseclassification techniques looks at a sub-space of the emotional manifoldor emotional subspace and therefore looks at different features. For themost problematic case, i.e. for the quality feature subspace, it may beintended to use more than one classifier for said subspace and even somekind of speaker dependency may be involved such as age, gender, and thelike. A final classification algorithm will then be implementedafterwards to merge the results of the subspace specialists.

[0034] (b) On the other hand it is rather easy to determine the degreeof pleasure and activation for a given emotion. Therefore, it ispossible based on this knowledge to infer the classification of suchemotion with a set of candidates. For that purpose it is necessaryeither to have a training data base appropriately labelled withdifferent levels of activation and pleasure, or a database labelled withemotions and then associate each of these emotions with fixedcoordinates in both dimensions, activation and pleasure. Theclassification will be accomplished in terms of such levels and therewill be a mapping from certain areas of the emotional space to differentemotions.

[0035] In the following, further advantageous aspects of the presentinvention will be described taking reference to the accompanyingFigures.

[0036]FIG. 1 is a schematical diagram showing the connection between agiven emotional space and a respective feature space.

[0037]FIG. 2 is a schematical block diagram describing a preferredembodiment of the inventive method for emotion detection.

[0038] In the foregoing and in the following, for brevity, the featureclasses for the distinct emotional dimensions A and E are also denotedby A and E, respectively.

[0039] In the schematical diagram of FIG. 1 an emotional space ES isgiven as a more or less abstract entity reflecting possible emotionalstates CES of a speaker per se. Each point of said emotional space EStherefore represents a possible current emotional state CES of a givenspeaker. Analyzing a speech input SI and extracting features f1, f2, f3or feature values therefrom on the basis of a given set of featureclasses E, A a mapping M from the so-called feature space FS into theemotional space ES is defined. Each point FCES in the feature space FSis represented as a n-tuple <f1, f2, f3> of parameter values or featurevalues of extracted features f1, f2, and f3 and therefore is a parameterrepresentation and/or approximation of a possible current emotionalstate CES.

[0040] The abscissa and the ordinate of the emotional space ES areassigned to the distinct feature classes E and A, whereas the axes ofthe feature space FS are assigned to distinct features to be extractedfrom a speech input SI. The value of the distinct feature parameter isdetermined by means of the speech input analysis. The value or degreefor the distict emotional dimensions in the emotional space—i.e. thedegrees of e.g. arousal A and evaluation E—are determined by thedistinct assigned classifiers CE and CA.

[0041] In general, there are features which are necessary for CA only orfor CE only. Additionally, there might be features which are used byboth classifiers CE and CA. In the case of unidimensional orone-dimensional classifiers CA and CE do not mix their outputs withrespect to the dimensions A and E, respectively, i.e. CA classifies forA only, and CE classifies for E only.

[0042] Each possible emotional state CES is therefore referred to as animage obtained by the mapping M of a distinct point FCES or n-tuple ofparameters in the feature space FS. The axes of the emotional space ESand therefore its dimensions E and A are assigned to the given featureclasses for E and A within the emotional space ES. These dimensionsdefine the image CES of the parameter representation FCES and thereforeclassify the current emotional state CES of a given current speaker asbeing active or passive and/or as being negative or positive.

[0043] With respect to each dimension of the emotional space ES adefinite and different classifier CA, CE is applied, having as an inputthe corresponding feature class A, E and as an output the position ofthe point CES in the emotional space ES regarding the assigned axis ordimension. Therefore, within the dimensions of activation/arousal andevaluation/pleasure a given speaker might be classified as being sad,bored, content, relaxed, pleased, happy, excited, angry, afraid, and/orthe like, each property being represented by distinct degrees within thecorresponding emotional dimensions A and E.

[0044]FIG. 2 elucidates by means of a schematical block diagram apreferred embodiment of the inventive method for detecting emotions froma speech input The method starts with a first and introducing step SO inwhich preliminary data are provided and evaluated. In a first stepS1—which might be repeated in the following—a speech input SI isreceived.

[0045] The method of the embodiment of FIG. 2 is mainly subdivided in afirst section S10 and a second section S20 which are assigned to theevaluation of the speech input SI with respect to a first feature classassigned to a first emotional dimension of arousal/activation A and asecond feature class assigned to a second emotional dimension ofevaluation/pleasure in said emotional space ES, respectively. Thesections S10 and S20 may be performed sequentially or in parallel, asthey are essentially independent.

[0046] In the first step S11 of the first section S10 within a firstfeature class A of prosodic features, said prosodic features orparameters thereof are generated and extracted from the analysis of thegiven speech input SI. The prosodic features may comprise pitch, pitchrange, loudness, speaking rate, and/or the like.

[0047] In the following step S12 of the first section S10 from saidprosodic features, feature vectors are constructed and they are mappedonto the subspace of activation/arousal according to the first featureclass A to classify a passive or active state of the current speaker.For the classification of the emotional state CES of the current speakerwithin the subspace of activation/arousal the classifier CA ofcomparative low complexity determines the degree of arousal/activationA.

[0048] On the other hand, in the second section S20 in a first step S21features of a second feature class E are generated which belong to a setof voice and/or speech quality features. These quality features mayinclude spectral tilt, amplitude difference between harmonics andformants, formants bandwidth, jitter, harmonic-to-noise-ratio, and/orthe like.

[0049] In step S22 also from these features feature vectors areconstructed and then mapped into the subspace or dimension ofevaluation/pleasure according to the second feature class E to classifya current speaker speaking negative or positive. For the classificationof the emotional state CES of the speaker in the subspace ofevaluation/pleasure the classifier CE of relative high complexity isInvolved and the determines the degree of evaluation/pleasure E. Thiscomplexity of the classifier CE may indeed include a multi-classifiersystem, speaker dependencies and/or the like.

[0050] The results coming out from these classification schemes of stepsS12 and S22 may be merged and fused by evaluating in a finalclassification algorithm according to step S30.

[0051] Finally in step S40 a current emotional state CES of the currentspeaker is detected and/or output as a result of the method.

[0052] The concept of subspace specialists is essentially based on theusage of classifiers which are in each case specialized with respect toor in a certain subspace of features. The identification and assignmentof feature classes with certain dimensions of the emotional space orsubspaces thereof is essentially based on phonetic and phonology theoryas well as on psychological and physiological studies. Any method forclassifying feature vectors can be used for building said classifiers orclassifier systems. These methods may include neural networks, supportvector machines, Gaussian mixtures, K-next neighbours or the like.

[0053] The combination of the results from the different classifiers orspecialists for each of the feature subspaces can be done with a thirdfinal classifier the inputs of which can be either degrees of eachdimension or conditional emotions to each dimension and the output ofwhich is the classified emotion.

1. Method for detecting emotions from speech input, wherein at least thefirst feature class (A) and a second feature class (E) of features areat least in part evaluated, derived and/or extracted from a given speechinput (SI), wherein from said features a current emotional state (CES)of the current speaker and/or parameters (CFS) thereof are derived andwherein said first and second feature classes (A, E) are identifiedand/or associated with, in particular distinct, dimensions of anunderlying emotional manifold (EM) or emotional space (ES) and/or withsubspaces thereof, in particular with activation or arousal andevaluation or pleasure, respectively.
 2. Method according to claim 1,wherein for each feature class (A, E), for each dimension/subspace ofsaid emotional manifold (EM) and/or for groups thereof a separate and/ordistinct specialist—in particular a subspace specialist or the like—or aspecialized classifier system (CA, CE) is used and/or constructed, eachof which in particular being adapted to classify or obtain featuresessentially from an assigned feature class (A, E) and/or an assigneddimension or subspace, in particular into or for a current emotionalstate (CES) of the current speaker and/or parameters thereof.
 3. Methodaccording to anyone of the preceding claims, wherein the distinctspecialized classifier systems (CA, CE) are applied to the differentfeature classes (A, E) already extracted from said speech input (SI)and/or to said speech input (SI) directly and wherein thereby evaluated,derived and/or extracted features, emotions, and/or parameters thereof,in particular from different feature subspaces, are collected and/orstored, In particular in order to obtain a final classification bycombination of the results.
 4. Method according to claim 3, wherein thefeatures, emotions, and/or parameters thereof evaluated, derived and/orextracted, in particular from different feature subspaces, are combined,in particular to describe a current emotional state (CES) of the currentspeaker and/or parameters thereof.
 5. Method according to anyone of thepreceding claims, wherein the distinct specialized classifier systems(CA, CE) for each feature class (A, E) and/or of the features or outputsthereofare combined, fused or merged, in particular to form a globalclassifier system and/or—in the case of non-orthogonal subspaces orclassifier systems (CA, CE)—in particular by means of a empiricalweighting algorithm or the like.
 6. Method according to anyone of thepreceding claims, wherein a class of prosodic features is used at leastas a part of said first feature class (A), which is in particular atleast in part identified with an activation and/or arousal dimension ofemotions or of the emotional manifold (EM).
 7. Method according toanyone of the preceding claims, wherein a class of speech or voicequality features is used at least as a part of said second feature class(E), which is in particular at least in part identified with a pleasureand/or evaluation dimension of emotions or of the emotional manifold(EM).
 8. Method according to anyone of the claims 2 to 7, wherein a—inparticular unidimensional—classifier system (CA) of low complexity isused, in particular as a classifier system (CA) for prosodic features.9. Method according to anyone of the claims 2 to 7, wherein a—inparticular unidimensional—classifier system (CE) of high complexity isused, in particular as a classifier system (CE) for speech and/or voicequality features.
 10. Method according to claim 10, wherein saidclassifier system (CE) of high complexity includes a multiplicity ofclassifiers, in particular by implementing speaker dependencies, such asage, gender and/or the like.
 11. Method according to anyone of theclaims 8 to 10, wherein the different classifiers can give as an outputnot only proper emotions, but also a degree of emotions, in particular adegree of pleasure and/or acitvation, depending on the featuresubspacers they have as an input, which can be combined afterwards toobtain a current emotional state (CES) of the speaker.
 12. Methodaccording to anyone of the preceding claims, wherein pitch, pitch range,Intonation attitude, loudness, speaking rate, phone duration, speechelement duration features and/or the like are used as prosodic features.13. Method according to anyone of the preceding claims, whereinphonation type, articulation manner, voice timbre features, spectraltilt, amplitude difference between harmonics and formants, formantsbandwidth, jitter, harmonic-to-noise-ratio features, and/or the like areused as speech and/or voice quality features.
 14. System for detectingemotions from speech input which is capable of performing and/orrealizing a method for detecting emotions according to anyone of theclaims 1 to 13 and/or the steps thereof.
 15. Computer program product,comprising computer program means adapted to perform and/or to realize amethod for detecting emotions according to anyone of the claims 1 to 13and/or the steps thereof when it is executed on a computer, a digitalsignal processing means and/or the like.